GASTS: Parsimony Scoring under Rearrangements
نویسندگان
چکیده
The accumulation of whole-genome data has renewed interest in the study of genomic rearrangements. Comparative genomics, evolutionary biology, and cancer research all require models and algorithms to elucidate the mechanisms, history, and consequences of these rearrangements. However, rearrangements lead to NP-hard problems, so that current approaches, such as the MGR tool, are limited to small collections of genomes and low-resolution data of a few hundred syntenic blocks. We describe the first algorithm for rearrangement analysis that scales up, in both time and accuracy, to modern high-resolution genomic data. Our main contribution is GASTS, an algorithm for scoring a fixed phylogenetic tree: given a tree and a collection of genomes, one for each leaf of the tree, each genome given by an ordered list of syntenic blocks, GASTS infers genomes for the internal nodes of the tree so as to minimize the sum, taken over all tree edges, of the pairwise genomic distances between tree nodes. We present the results of extensive testing on both simulated and real data showing that our algorithm runs several orders of magnitude faster than existing approaches and scales up linearly instead of exponentially with the size of the genomes involved; on the small instances that current approaches can complete in a day, our algorithm also returns much better scores. In simulations, our tree scores stay within 0.5% of the model value for trees up to 100 taxa and genomes of up to 10,000 syntenic blocks. GASTS enables us to attack heretofore unapproachable problems, such as accurate ancestral reconstruction of large genomes and phylogenetic inference for high-resolution vertebrate genomes, as we demonstrate on a set of vertebrate genomes with over 2,000 syntenic blocks.
منابع مشابه
Linear Programming for Phylogenetic Reconstruction Based on Gene Rearrangements
Phylogenetic reconstruction from gene rearrangements has attracted increasing attention from biologists and computer scientists over the last few years. Methods used in reconstruction include distance-based methods, parsimony methods using sequence-based encodings, and direct optimization. The latter, pioneered by Sankoff and extended by us with the software suite GRAPPA, is the most accurate a...
متن کاملSmall Phylogeny Problem: Character Evolution Trees
Phylogenetics is a science of determining connections between groups of organisms in terms of ancestor/descendent relationships, usually expressed by phylogenetic trees, also called “trees of life”, cladograms, or dendograms. In parsimony approach to reconstruct the phylogenetic trees, the goal is to find the most parsimonious tree, i.e., the tree requiring the smallest number/score of evolutio...
متن کاملPhytotaxa 1: 3–20 (2009) Phylogenetic analysis of non-coding plastid DNA in the presence of short
The presence of short inversions in non-coding plastid DNA is universal and an integral part of its evolution. We studied the effect of these structural changes in phylogenetic inference by measuring phylogenetic accuracy as a topological congruence among topologies inferred from plastid and nuclear sequences in ferns (Lindsaea) and mosses (Brachytheciaceae). We randomly replicated ten subsets ...
متن کاملFast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study
Evolution operates on whole genomes through mutations that change the order and strandedness of genes within the genomes. Thus analyses of gene-order data present new opportunities for discoveries about deep evolutionary events, provided that sufficiently accurate methods can be developed to reconstruct evolutionary trees. In this paper we present two new methods of character coding for parsimo...
متن کاملParsimony via consensus.
The parsimony score of a character on a tree equals the number of state changes required to fit that character onto the tree. We show that for unordered, reversible characters this score equals the number of tree rearrangements required to fit the tree onto the character. We discuss implications of this connection for the debate over the use of consensus trees or total evidence and show how it ...
متن کامل